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Recently  we  have  witnessed  the  advent  ~of  general 
purpose  data  base  management  systems  and  impoi  uni  advance* 
- in-  computer  networks.  The  combination  of  tiie  two 
technologies  .to  produce  distributed  data  base  management 
systems  should  be  the  next  significant  step  in  commercial 
systems  development.  A  completely  generalized  distributed 
data  base  management  system  would  reside  on  a  heterogeneous 
computer  network  with  different  data  base  systems  available 
at  various  processors.  Communication  and  data  transfer 


would  ne  possible  between  any  nodes  in  the 
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This  report  described  the  principal  problem  areas  in 
distributed  data  base  management  system  development. 
Distributed  data  base  systems  share  many  design  problems 
with  both  single  machine  data  base  systems  ana  computing 
networks,  as  well  as  introducing  several  new  dilemmas. 

Recent  research  in  these  problem  areas  is  presented  to 
provide  a  picture  of  the  state  of  the  art  of  distributed 
data  base  development.  In  addition,  the  current  status  of 
the  data  base  industry  with  respect  to  distributed 
processing  is  evaluated  by  reporting  the  current  projects 
and  future  plans  of  selected  (anonymous)  data  base  vendors. 
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Abstract 


Recently  we  have  witnessed  the  advent  of  general  purpose  data  base 
management  systems  and  important  advances  in  computer  netwc  vs.  The  com¬ 
bination  of  the  two  technologies  to  produce  distributed  data  base  manage¬ 
ment  systems  should  be  the  next  significant  step  in  commercial  systems 
development.  A  completely  generalized  distributed  data  base  management 
system  would  reside  on  a  heterogeneous  computer  network  with  different 
data  base  systems  available  at  various  processors.  Communication  and  data 
transfer  would  be  possible  between  any  nodes  in  the  network.  The  realiza¬ 
tion  of  this  goal  is  still  several  years  in  the  future.  However,  consider¬ 
able  progress  in  the  area  of  distributed  data  base  systems  has  been  made 
in  both  academic  and  industrial  environments. 

This  report  describes  the  principal  problem  areas  in  distributed 
data  base  management  system  development.  Distributed  data  base  systems 
share  many  design  problems  with  both  single  machine  data  base  systems  and 
computing  networks,  as  well  as  introducing  several  new  dilemmas. 

Recent  research  in  these  problem  areas  is  presented  to  provide  a 
picture  of  the  state  of  the  art  of  distributed  data  base  development. 

In  addition,  the  current  status  of  the  data  base  industry  with  respect  to 
distributed  processing  is  evaluated  by  reporting  the  current  projects  and 
future  plans  of  selected  (anonymous)  data  base  vendors. 


INTRODUCTION 


Recently,  we  have  witnessed  the  advent  of  general  purpose  data  base 
management  systems  (DBMS)  and  important  advances  in  computer  networks. 

The  combination  of  the  two  technologies  to  produce  distributed  data 
base  management  systems  should  be  the  next  significant  step  in  commercial 
systems  development.  A  completely  generalized  distributed  data  base 
management  system  would  reside  on  a  heterogeneous  computer  network  with 
different  data  base  systems  available  at  various  processors.  Communi¬ 
cation  and  data  transfer  would  be  possible  between  any  nodes  in  the  net¬ 
work.  The  realization  of  this  goal  is  still  several  years  in  the 
future.  However,  considerable  progress  in  both  academic  and  industrial 
environments  has  been  made  in  the  area  of  distributed  data  base  systems. 

This  report  describes  the  principal  problem  areas  in  distributed 
DBMS  development.  As  indicated  by  Fry  and  Sibley*,  distributed  data 
base  systems  share  many  design  problems  with  both  single  machine  data 
base  systems  and  computing  networks  as  well  as  introducing  several  new 
dilemmas . 

Recent  research  in  these  problem  areas  is  presented  to  provide  a 
picture  of  the  state  of  the  art  of  distributed  data  base  development. 

In  addition,  the  current  status  of  the  data  base  industry  with  respect  to 
distributed  processing  is  evaluated  by  reporting  the  current  projects 
and  future  plans  of  selected  (anonymous)  data  base  vendors. 

Distributed  Data  Base  Taxonomy 

Development  of  a  completely  general  distributed  data  base  manage¬ 
ment  system  must  evolve  through  several  less  complex  forms.  This 
evolutionary  process  is  currently  in  progress.  The  beginning  point  of 
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the  development  Is  with  data  base  management  systems  targeted  for  a  single, 
general  purpose  computer.  The  development  of  such  systems  is  among  the 
most  significant  events  in  computer  science.  For  a  detailed  treatment 
of  the  evolution  of  data  base  systems,  see  the  work  of  Fry  and  Sibley.* 

In  order  to  properly  classify  distributed  DBMS  research  and 
development,  several  stages  in  the  evolution  of  a  generalized  distributed 
DBMS  are  listed  along  with  a  brief  discussion  of  their  current  status. 

1.  DBMS  for  a  single  general  purpose  machine  -  many  such  systems 
are  commercially  available. 

2.  Data  Base  machines  -  special  purpose  processors  whose  function 
is  data  management. ^ 

3.  Back-End  DBMS  -  a  network  of  two  or  more  machines  in  which  one  of 

the  processors  is  dedicated  to  performing  the  data  base  management 

function.  The  dedicated  data  base  processor  is  known  as  the  back-end 
3 

machine.  The  back-end  machine  may  be  a  general  purpose  computer  or 
contain  specialized  hardware  or  firmware. 

4.  Special  purpose  distributed  DBMS  -  several  special  purpose  data 

base  systems  have  been  implemented  on  computer  networks.  Medical  information 

networks*' ^  have  been  one  of  the  major  areas  of  emphasis  for  distri- 

•  6 

buted  data  systems.  Airline  reservation  systems  can  also  be  classified 
in  this  area. 

5.  Single  software  DBMS  on  a  homogeneous  network  -  this  is  the 
next  step  to  be  realized  in  distributed  DBMS  development.  Many  hardware 
manufacturers  provide  facilities  for  communication  between  their  own 
machines.  The  data  base  software  must  be  enhanced  to  allow  tasks  residing 
on  different  processors  to  communicate.  The  problems  of  file  allocation, 
privacy,  and  deadlock  become  quite  complex  at  this  level. 
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6.  Single  software  DBMS  on  a  heterogeneous  network  -  by  allowing 
differing  brands  of  processors  into  the  network,  comparability  problems 
are  introduced.  Communication  protocols  and  data  conversion  schemes 
are  necessary  for  the  extension  to  heterogeneous  networks. 

7.  Multiple  software  data  base  systems  on  a  heterogeneous  network- 
the  most  general  distributed  DBMS  is  one  which  accomodates  users  with 
different  data  base  software  systems  as  well  as  hardware  obtained  from 
multiple  vendors.  In  addition  to  containing  all  the  problems  of  the 
previously  mentioned  systems,  the  difficult  task  of  data  base  translation 
is  introduced.  Data  base  translation  entails  structural  as  well  as  code 
conversion. 

The  current  status  of  distributed  data  base  systems  reflects  the 
classical  relationship  of  research,  development,  and  production.  Research 
is  ongoing  in  all  of  the  problem  areas  mentioned  in  the  distributed  DBMS 
breakdown.  Progress  in  the  various  areas  is  described  in  the  following 
sections  of  this  report. 

DATA  BASE  MACHINES  AND  BACK-END  PROCESSORS 

The  terms  "data  base  machine"  and  "back-end  processor"  have  recently 
been  added  to  the  computer  lexicon.  Since  they  have  evolved  independently, 
there  is  some  overlap  in  their  accepted  definitions.  A  data  base 
machine  is  a  special  purpose  computer  dedicated  to  data  base  management. 

A  back-end  processor  is  a  computer  that  performs  the  data  management 
function  for  one  or  more  different  computers.  Based  upon  these  definitions, 
it  is  possible  that  a  data  base  machine  be  used  as  a  back-end  processor. 

Let  us  first  observe  the  recent  developments  in  data  base  machines. 

One  of  the  most  common  operations  performed  in  a  DBMS  is  the  location  of 
data  given  some  key.  A  natural  method  for  constructing  an  efficient 
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DBMS  process  is  to  optimize  tlio  da t a  look-up  operation.  Consi-quont  ]y  , 

assoc  i .1 1  i  vo  processors  have  been  proposed  for  usage  a::  data  base 

..  7-10 

machine's 

g 

DeFiore  and  Bora  show  that  the  parallel  search  capabilities 
of  an  associative  DBMS  allow  it  to  outperform  an  inverted  list  structure 
in  an  inquiry  environment.  They  have  developed  a  data  management 
system  using  associative  memory  at  Rome  Air  Development  Center. 

This  system  uses  the  associative  processor  as  a  small,  fast  memory  buffer. 
The  data  base  is  stored  on  conventional  secondary  storage  devices  and 
paged  into  the  associative  memory.  The  associative  memory  allows  searches 

in  the  order  of  microseconds  but  the  page  transfer  is  in  the  order  of 

9 

milliseconds.  The  CASSM  project  at  the  University  of  Florida  developed 
the  concept  of  a  context-addressable  disc  system  capable  of  performing 
data  manipulations  in  secondary  memory  independent  of  the  CPU.  By 
storing  the  data  base  on  the  associative  machine,  the  excessive  loading 
delay  is  eliminated.  The  approach  taken  in  the  CASSM  project  has 
been  followed  at  the  University  of  Toronto  by  Ozkarahan,  Schuster, 
and  Smith^  in  the  design  of  an  associative  processor  -  RAP  whose 
instruction  set  is  a  high-level  relational  data  base  language.  By 
processing  data  operations  at  the  machine  level,  RAP  has  the  potential 
of  overcoming  the  execution  time  problems  that  have  hampered  relational 
data  base  systems.  RAP  is  designed  to  operate  upon  data  base  of  up  to 
100M  bits.  This  capacity  can  be  extended  by  conventional  mass  storage 
devices. 

The  concept  of  a  processor  with  data  base  primitives  is  also 
under  investigation  by  Anderson^  who  has  proposed  placing  the  data 
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base  functions  in  firmware. 


The  microprogramable  data  base  processor  is 


proposed  as  a  back-end  machine  by  Anderson.  This  processor  could 
either  be  directly  connected  to  a  single  general  purpose  processor  or 

act  as  a  specialized  data  base  processing  node  in  a  network. 

12 

The  Datacoiaputer  has  been  developed  by  Computer  Corporation  of 
America  as  a  special  purpose  data  base  processor • for  use  in  hetero¬ 
geneous  networks.  The  Datacooputer  in  its  current  version  is  a  PDP-10 
that  provides  data  base  service  in  the  Arpanet.  The  data  base  management 
system  which  uses  the  inverted  file  structure  concept  is  strictly  a 
software  implementation.  All  requests  for  data  base  operations  are 
transmitted  to  the  Datacomputer  in  Datalanguage  which  is  a  high-level 
data  base  language.  Since  the  Datacomputer  interfaces  with  a  variety 
of  machine  types,  it  must  perform  translations  between  various  physical 
data  representations  and  logical  data  structures.  Because  of  its 
ability  to  communicate  with  heterogeneous  machines,  the  Datacomputer 
as  implemented  in  the  Arpanet  is  the  closest  existing  approximation  to  a 
completely  distributed  data  base  system.  The  necessary  extensions  are 
the  geographic  distribution  of  data  which  could  be  accomplished  by 
placing  Datacomputers  at  several  points  in  the  network  and  the  incorpora¬ 
tion  of  different  software  data  base  systems  into  the  network.  The 
communication  between  different  data  base  systems  remains  a  major  hurdle 

to  a  generalized,  distributed  DBMS. 

13 

The  INFOPLEX  system  which  is  presently  under  design  is  intended  to 
be  a  highly  parallel  information  management  system.  INFOPIZX  is  similar 
in  terms  of  organization  to  CASSM  and  RAP  in  that  the  data  kase  management 
function  is  performed  by  a  complex  of  microprocessors.  Eadb  high-level 
data  base  request  is  functionally  decomposed  and  executed  Am  parallel  by  the 
microprocessor  complex.  The  concept  of  decomposition  of  da Ha  base 
requests  could  be  applied  at  a  higher  level  to  a  network  off  minicomputer 
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dat.i  bast-  machines  which  share  the  management  of  a  data  base. 

The  single  host,  single  back-end  approach  has  been  implemented  as 

3 

a  prototype  by  Canaday,  et  al  at  Roll  Telephone  Laboratories.  The 
XHMS  consisted  of  a  TNI VAC  1108  host  which  executed  the  application 
programs  and  a  META-4  back-end  which  performed  the  data  base  functions. 

The  software  system  used  wasDMS— 1100  which  is  a  derivative  of  the 
CODASi L  specifications.  ’ ^  XDMS  was  the  first  working  back-end 
system  and  has  provided  impetus  for  considerable  future  work  in  this  area. 


Benefits  of  Back-End  DBMS 

A  back-end  DBMS  can  serve  as  the  nucleus  of  a  distributed  data 
base  system.  Based  upon  work  presently  under  way  in  the  data  base 
industry,  the  author  can  project  the  emergence  of  a  back-end  computer 
as  a  significant  product  in  the  near  future. 

The  first  versions  of  the  back-end  computers  will  be  general 
purpose  minicomputers  which  contain  software  data  base  systems. 

The  back-end  machines  will  interface  with  IBM  360/370's  (or  similar 
mainframes)  via  conventional  linkages.  Later  versions  will  include 
special  purpose  data  base  machines  and  more  general  and  higher  speed 
machine  interconnections.  These  projections  are  made  based  upon  an 
analysis  of  the  recent  research  advances  outlined  in  the  previous 
section  and  discussions  with  individuals  in  the  data  base  industries. 

An  evaluation  of  the  current  state  of  the  industry  appears  in  a 
later  section  of  this  paper. 

The  feasibility  of  a  back-end  DBMS  in  a  data  processing  environment 
has  been  the  subject  of  several  reports. ^  These  studies  indicate 
that  a  back-end  DBMS  can  provide  benefits  in  a  wide  range  of  areas. 

The  effects  of  a  back-end  DBMS  are  briefly  considered  here. 


m 
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all  of  the  machine's  resources. 


Therefore  additional  computing  power 


becomes  available  at  a  relatively  low  price. 

e.  Modularity  -  The  basic  back-end  configuration  forms  a  well- 
structured  compact  unit.  New  machines  can  be  added  easily  to  this 
configuration  either  in  a  multi-processor  back-end  arrangement  or 
as  a  stand  alone  computer. 

ORGANIZATION  OF  DISTRIBUTED  DATA  BASE  SYSTEMS 

There  have  been  several  alternative  organizations  proposed  for 

distributed  data  base  systems.  The  function  of  the  system,  the 

geographic  distribution  of  the  data,  and  the  philosophy  of  the  designers 

all  influence  the  organization  of  the  distributed  system.  In  all 

cases,  the  distributed  DBMS  resides  on  a  computer  network.  In  the 

next  section  the  characteristics  of  the  interface  between  the  network 

and  data  base  software  are  discussed. 

Several  approaches  have  been  used  in  the  description  of  dis- 
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tributed  data  base  organization.  Aschim  classifies  data  bases 
according  to  the  geographical  distribution  of  data  bases  and  direc¬ 
tories.  Data  base  and  directories  may  either  be  centralized  or 
distributed  under  this  scheme.  Figures  1  and  2  illustrate  the 
situations  of  distributed  data  bases  with  a  centralized  directory  and 
distributed  directory  respectively.  Aschim  also  describes  the  benefits 
and  problems  of  a  single  and  multiple  software  systems  controlling  data 

base  operations. 

22 

Booth  classifies  distributed  data  bases  by  the  amount  of 
redundancy  in  the  data  base.  She  describes  partitioned  data  bases 
as  logical  data  base  spread  across  several  computers  and  replicated 
data  bases  in  which  portions  of  the  data  base  are  replicated  at 


different  nodes  in  the  network.  Data  bases  nay  be  partitioned 

based  upon  accessibility;  that  is,  the  locating  files  at  the  machine 

at  which  they  are  most  likely  to  receive  the  heaviest  usage.  This 

technique  reduces  the  amount  of  intermachine  communication  which  can 

20 

become  the  limiting  factor  in  distributed  data  base  performance. 

22 

Booth  also  describes  a  vertical  partitioning  technique,  see  Figure  3, 
in  which  a  large  central  data  base  is  supplemented  by  several  remote 
data  bases. 

23 

In  an  earlier  paper,  Booth  provides  an  analysis  of  the  tradeoffs 
between  redundancy  and  division  of  data  in  a  distributed  network.  The 
benefits  of  redundancy  in  a  distributed  environment  are  increased 
access  to  the  multiple  copies  of  the  data,  readily  available  backup,  and 
decreased  communication  time  to  access  the  data  since  a  copy  can  be 
located  close  to  the  point  at  which  it  is  used.  The  primary  problems 
resulting  from  redundancy  are  the  cost  and  complexity  of  updating  a 
redundant  data  file  and  the  requirement  of  additional  storage  services. 
When  a  large  active  file  is  divided  among  several  back-end  processors 
in  a  distributed  system,  accessibility  of  the  data  can  be  increased. 
However,  there  is  some  control  and  communications  overhead  in  the 
distributed  situation,  as  opposed  to  the  case  of  a  single  data  file 
accessed  by  one  computer.  The  development  of  efficient  network 
software  can  minimize  this  overhead. 

COMMUNICATION  IN  A  DISTRIBUTED  DATA  BASE  NETWORK 

A  distributed  data  base  management  system  must  be  built  upon  a 
computer  networking  facility.  Since  communication  time  is  a  critical 
factor  in  distributed  data  base  performance,  an  efficient  network 
communciation  mechanism  is  an  essential  requirement  for  a  distributed 
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data  base  system.  The  complexity  of  the  network,  control  software  is 
related  to  the  homogenet it y  of  the  machines  in  the  network.  If  the  soft¬ 
ware  data  base  systems  on  conversing  machines  are  of  different  types, 
then  a  sophisticated,  translation  mechanism  must  be  developed. 

Network  control  software  for  distributed  data  bases  has  been 

21 

functionally  specified  by  several  researchers.  Aschim  describes  a 
message  switching  environment  for  the  communication  of  data  base  requests 
between  a  host  and  back-end  machine.  He  describes  the  information 

that  the  host  and  back-end  communicat ion  tasks  must  have  available. 

21 

According  to  Aschim,  the  host  communication  task  must  have 
knowledge  of  the  following  items. 

1.  The  identification  of  the  back-end  task  that  will  access 
the  requested  data; 

2.  The  data  base  name  of  the  requested  data; 

3.  Translation  requirements; 

4.  The  interprocess  protocol; 

5.  The  mechanism  for  interpreting  the  response. 

Similarly,  the  back-end  communication  task  must  have  the  following 
information : 

1.  Type  of  message  received; 

2.  Translation  requirements; 

3.  The  mechanism  to  satisfy  the  data  base  request; 

4.  Identification  of  sending  process; 

5.  Conditions  for  sending  a  response; 

6.  Interprocess  protocol. 

24 

Peebles  proposes  a  separation  of  the  network  communication 
and  data  base  control  functions.  He  specified  a  network  control 
language  for  generalized  intermachine  communication.  A  translation 
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mechanism  is  contained  within  the  network  control  language.  A  distri¬ 
buted  data  base  management  system  can  be  developed  on  top  of  the 
network  communication  system. 

The  interface  b.etween  a  data  base  managraent  system  and  a  network 
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communication  system  is  explained  by  Maryanski,  et  al .  The  communi¬ 
cation  system  proposed  in  that  paper  provides  standard  methods  for 
inter-task  communication  and  a  protocol  for  the  transmission  of  infor¬ 
mation  among  processors.  In  this  system  the  translation  mechanism  is 
considered  a  part  of  the  data  base,  not  the  network,  software. 

Additional  details  of  the  application  of  the  network  communication 
system  to  a  distributed  data  base  environment  are  given  by  Wallentine 
and  Maryanski. ^ 
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The  Datacomputer  relies  upon  the  standard  Arpanet  communication 
facilities  to  exchange  information  with  its  host  processors.  All 
communication  occurs  in  Datalanguage  which  can  be  viewed  in  this 
context  as  a  standardized  message  facility.  The  Datacomputer  per¬ 
forms  all  translations  internally.  However,  the  form  of  stored  data 
is  determined  by  the  host  machine. 

FILE  ALLOCATION  IN  DISTRIBUTED  DBMS 

One  of  the  key  decisions  to  be  made  by  a  data  base  administrator 
is  the  allocation  of  data  files  among  physical  devices.  The  ultimate 
goal  of  an  allocation  policy  is  to  equally  distribute  the  utilization 
of  the  devices.  A  file  allocation  policy  requires  information,  actual 
or  hypothesized,  concerning  file  utilization.  In  many  environments, 
data  base  behavior  is  dynamic.  Therefore,  utilization  patterns  must 
be  constantly  monitored  in  order  to  insure  an  acceptable  file  distribution. 


Naturally,  there  is  a  cost  associated  with  monitoring  file  utilization 
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and  reallocating  files.  A  balance  between  the  costs  of  a  non-optinal 

file  organization  and  the  cost  of  realloca-len  must  be  achieved. 

When  a  data  base  is  distributed  over  several  machines,  the 

complexity  of  the  file  allocation  problem  increases.  In  a  centralized 

DBMS,  poor  file  allocation  results  in  heavy  traffic  on  certain  channels 

and  excessive  waiting  for  the  channels.  In  a  distributed  environment, 

communication  cost  of  obtaining  data  from  a  file  resident  at  another 

network  node  becomes  an  important  factor.  Data  must  be  allocated 

first  among  back-end  processors  and  then  among  the  devices  attached 
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to  the  back-end  processors.  Morgan  and  Levin  have  proposed  three 
parameters  for  describing  file  allocation  algorithms  for  distributed 
data  base  management  systems.  The  parameters  are: 

1.  Level  of  data  sharing; 

2.  Behavior  of  access  patterns; 

3.  Type  of  information  available  on  the  behavior  of  acess 
patterns. 

The  level  of  sharing  indicates  whether  the  DDBMS  is  partitioned, 
replicated,  or  some  combinationof  the  two  organization  schemes.  The 
access  pattern  of  the  system  may  span  the  spectrum  from  inquiry  only 
to  total  update.  The  important  factor  for  file  allocation  is  if  the 
access  pattern  may  vary  significantly.  The  final  parameter  is  whether 
the  information  on  the  behavior  of  the  access  patterns  is 
deterministic  or  probabilistic. 

The  standard  approach  in  file  allocation  problems  is  to  develop 

a  generalized  cost  equation  and  then  seek  a  file  assignment  which 

minimizes  that  equation.  At  the  highest  level,  the  cost  function 
27 

can  be  given  as 


m 
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C(Ak)  =  Q(Ar)  +  U(,\R)  +  S(Ak)  (1) 

where 

is  the  k*"^  assignment; 

Q(A^)  is  the  query  cost  of  the  assignment; 

IKA^)  is  the  update  cost;  and 
s(V  is  the  storage  cost. 

As  of  the  present  time,  the  results  obtained  for  file  allocation 

in  distributed  data  base  networks  are  for  cases  of  static,  deterministic 

access  patterns.  However,  Morgan  3nd  his  associates  are  studying  the 

properties  of  networks  in  which  access  patterns  vary  dynamically  or 
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are  described  by  probability  distributions. 
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Levin  has  shown  that  the  multiple  file  allocation  problem  for 

the  static,  deterministic  case  can  be  solved  by  determining  the 

optimal  allocation  for  the  individual  files.  Levin's  cost  function  is 
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a  refinement  of  eq.  (1).  The  model  developed  by  Chu  represents 

file  allocation  cost  as  the  sum  of  storage  and  transmission  times. 

The  overall  structure  of  Levin's  and  Chu's  model  are  quite  close,  both 

involving  linear  programming  techniques  to  effect  a  solution. 
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Casey  studies  the  file  allocation  problem  for  tree  networks 
which  are  a  restricted  type  of  network  topology.  In  a  tree  network, 
there  arc  no  loops  formed  by  internode  paths.  Casey  selected  tree 
networks  because  of  both  their  simplicity  and  practicality.  Trees 
have  no  routing  problems  and  are  the  optimal  organization  in  a  distri¬ 
buted  data  base  with  a  multiple  host,  single  back-end  structure.  The 
most  significant  feature  of  tree  networks  with  respect  to  file 

allocation  is  that  the  cost  equations  are  less  complex  than  those 
29  28 

developed  by  Chu  and  Levin  for  more  general  network  organizations. 
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A  different  computational  approach  to  computing  the  optimal  file 
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allocation  is  presented  by  Casey  in  another  work.  He  uses  a  linear  cost 
model  for  the  allocation  of  network  resources  that  is  similar  to  those  used  to 
determine  the  most  economical  location  for  plants  and  warehouses.  A 


search  procedure  which  is  shown  to  produce  an  optimal  allocation  is 
31 


developed.  Casey  also  suggests  some  heuristics  intended  to  improve 
performance  of  the  algorithm.  This  work,  in  effect,  applies  goal- 
oriented  searching  techniques  commonly  used  in  artificial  intelligence 


work  to  the  file  allocation  problem. 
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Chu  has  also  studied  the  relationship  between  access  type  and 

32 


distributed  data  base  organization.  Chu  develops  cost  equations  for 
partitioned,  partially  replicated,  and  fully  replicated  data  bases 
(i.e.,  a  copy  of  a  file  at  each  node  which  assesses  the  file).  A 
partially  replicated  configuration  contains  one  copy  of  every  data 
file  for  each  cluster  in  the  network.  A  group  of  processors  joined 
together  via  very  high  speed  links  (memory- to-memory  connections) 


forms  a  cluster.  Using  a  cost  equation  based  upon  communication, 
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storage,  and  translation  costs,  Chu  reaches  the  intuitively  appealing 
result  that  in  query  mode  a  replicated  data  base  provides  superior 
performance, while  under  heavy  update  a  partitioned  data  base  organiza¬ 
tion  is  more  efficient.  Under  the  assumption  that  transmission  cost 
is  higher  than  storage  costs,  Chu  indicates  a  breakeven  point  of  10% 
update  between  replicated  and  partitioned  and  50%  update  between  fully 
and  partially  replicated. 
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The  special  case  of  a  100%  query  environment  is  treated  by  Ghosh 
who  addresses  the  problem  of  distributing  a  data  base  in  order  to 
provide  for  completely  parallel  searching.  Ghosh  provides  algorithms 
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thnt  specify  a  data  base  distribution  that  allows  parallel  searching 
from  a  set  of  queries  with  a  given  form  and  a  specific  number  of  pro¬ 
cessor  nodes.  He  considers  both  replicated  and  partitioned  data  base 
organization.  Again  the  tradeoff  of  redundancy  in  the  data  base  is 
considered;  lower  search  time  versus  increased  storage  costs. 

Thus  far,  the  discussion  on  file  organization  has  concentrated 

upon  the  distribution  of  data  among  processor  nodes  of  a  distributed 
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data  base  network.  Salasin  describes  a  method  for  distributing  a 
data  base  over  the  storage  hierarchy  of  a  single  processor.  The  storage 
devices  of  the  processor  are  ordered  in  terms  of  access  speed.  The 
difference  between  the  conventional  storage  organization  and  Salasin's 
hierarchical  approach  is  illustrated  in  Figure  4.  The  most  significant 
performance  feature  of  this  proposed  storage  arrangement  is  the  bufferring 
of  data.  If  data  is  found  at  level  K,  it  is  also  present  at  all 
higher  levels.  Salasin  constructs  probabilistic  models  which 
indicate  that  bufferring  provides  performance  benefits  for  sequential, 
random,  and  linked  list  file  organizations. 

File  organization  has  been  studied  more  thoroughly  than  other 
subjects  related  to  distributed  data  bases.  Many  linear  programming 
methods  have  been  proposed  for  the  determination  of  optimal  file 
placement.  The  practicality  of  these  methods  for  use  in  data 
processing  environments  is  a  matter  of  conjecture.  The  main  problems 
are  that  the  algorithms  require  precise  usage  statistics  and  do  not 
consider  constraints  imposed  by  security  or  company  policy. 

DEADLOCKS 

A  good  file  allocation  scheme  results  in  efficient  utilization  of 
a  distributed  data  base  management  system.  However,  if  a  distributed 
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DBMS  designer  ignores  file  allocation,  it  is  likely  that  the  system  will 
still  operate  (perhaps  in  a  very  inefficient  manner).  This  situation 
does  not  hold  for  the  deadlock  problem.  If  the  system  designer  does 
not  consider  the  possibility  of  deadlocks,  severe  problems  may  arise. 

Deadlock  in  a  DBMS  is  an  unfortunate  side  effect  of  the  need  for 
a  portion  of  the  data  base  to  be  shared  by  several  data  base  tasks j 
at  least  one  of  which  is  updating  the  shared  information.  For  example, 
if  a  record  is  to  be  updated  by  a  task,  it  is  necessary  that  no  other 
task  be  allowed  to  access  the  record  during  the  update  procedure.  Failure 
to  provide  a  blocking  mechanism  can  result  in  incorrect  information 
appearing  in  the  data  base.  If  there  are  portions  of  the  data  base 
that  may  be  accessed  simultaneously  by  several  tasks,  then  a  deadlock 
condition  may  occur.  Deadlock  occurs  when  two  or  more  tasks  have 
blocked  each  other  from  execution  by  locking  shared  portions  of  the 
data  base.  Figure  5  illustrates  deadlock  of  two  data  base  tasks. 

The  underlying  cause  of  deadlock  in  data  base  systems  is  the 
organization  of  conventional  secondary  storage  media.  In  order  to 
optimize  the  utilization  of  secondary  storage,  several  requests  must 
be  processed  simultaneously.  This  organization  tends  to  reduce  disk 
head  movement  and  latency  which  is  often  the  limiting  factor  in  per¬ 
formance  of  a  data  base  system.  The  deadlock  problem  can  be  avoided 
in  systems  which  do  not  use  convential  secondary  storage.  Associative 
machines  ^  ^  batch  all  data  base  requests  and  provide  the  task  issuing 
the  request  exclusive  control  of  the  data  base,  thus  avoiding  deadlock. 

The  deadlock  problem  has  been  studied  at  great  length  by  researchers 
in  operating  system-..  The  general  principles  for  the  detection  or 
prevention  of  deadlock  that  have  been  developed  by  these  researcher?. 
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arc  also  applicable  to  data  base  management-  However,  the  DBMS 

deadlock  problem  is  compounded  by  the  need  to  insure  that  uniformly  correct 

data  is  maintained  throughout  the  system. 

For  example,  if  in  the  situation  depicted  in  Figure  5,  the  dead¬ 
lock  is  resolved  by  returning  Task  A  to  its  starting  point  and  releasing 
its  resources  (this  procedure,  is  called  "rollback"),  then  Record  1 
must  be  restored  to  its  condition  prior  to  being  modified  by  Task  A. 

If  the  locking  procedure  used  permits  retrieval  while  preventing 
updates  by  other  tasks,  then  it  is  possible  that  some  Task  C  may  have 
retrieved  and  operated  upon  Record  1  in  its  altered  state  (after  step 
a^)  •  In  this  situation  Task  C  would  also  have  to  be  rolled  back. 

Rollback  of  Task  C  may  result  in  the  necessity  of  rolling  back  still 
other  tasks  in  the  system,  thus  causing  a  substantial  performance 
degradation. 

Deadlock  in  data  base  management  systems  has  been  investigated 
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by  several  researches.  However,  relatively  little  work  has 

concentrated  on  deadlock  in  distributed  data  base  management  systems. 

In  a  distributed  DBMS  the  problem  is  complicated  by  the  fact  that  the 
interacting  tasks  may  reside  upon  different  machines.  Therefore,  the 
communication  overhead  in  the  rolling  back  of  tasks  can  become  very 
substantial.  Consequently,  in  a  distributed  DBMS,  a  deadlock  prevention 
scheme  may  provide  better  overall  system  performance  than  a  deadlock 
detection  approach. 

Preventing  deadlocks  in  a  distributed  DBMS  is  the  subject  of  a 
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report  by  Chu  and  Ohlmacher.  They  propose  two  approaches  to  deadlock 
prevention.  The  first  method  requires  a  data  base  task  to  indicate 
its  resource  (file)  requirements  before  initiation.  A  task  is  started 
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only  if  all  requested  resources  can  be  assigned  to  the  task.  This 
approach  Is  very  straightforward.  However,  a  task  may  not  access  all 
files  in  a  given  execution.  Thus,  the  Initiation  of  a  task  may  be 
needlessly  delayed. 

The  second  technique  for  deadlock  prevention  proposed  by  Chu 
and  Ohlmacher  is  based  upon  the  notion  of  task  sets.  A  task  set  is  a 

collection  of  tasks  with  access  to  common  files.  Whenever  a  task  has 

the  need  to  access  a  file,  the  system  determines  if  all  files  that  may 
be  accessed  by  the  requesting  task  and  other  members  of  its  task  set 
are  available.  If  all  such  files  are  free,  the  task  is  permitted  to 
proceed.  Otherwise,  the  task  must  wait  until  its  files  are  available. 

The  task  sets  change  as  tasks  are  initiated  and  terminated.  Chu 

and  Ohlmacher  present  an  algorithm  for  assigning  the  control  of  a 

particular  task  set  to  a  processor  node  in  the  network.  All  requests 
for  files  by  tasks  in  the  set  must  be  directed  to  the  controlling 
processor. 

An  analysis  of  the  two  proposed  deadlock  prevention  algorithms 
for  distributed  data  base  systems,  indicates  that  the  dynamic  nature 
of  the  second  technique  has  both  benefits  and  drawbacks.  It  has 
the  advantage  of  allowing  tasks  to  proceed  until  an  actual  file 
request  occurs.  However,  the  maintenance  of  the  task  sets  may  require 
considerable  .communication  overhead  in  a  network  environment. 

An  Important  factor  for  any  proposed  deadlock  handling  algorithm 
for  distributed  data  bases  would  be  operational  efficiency.  Due  to 
the  complexity  of  distributed  data  base  systems,  it  is  difficult  to 
determine  the  efficiency  of  such  algorithms  analytically.  Therefore, 
the  practicality  of  the  treatment  of  a  deadlock  in  a  distributed  DBMS 
cannot  be  determined  until  more  distributed  data  bases  are  implemented. 


DATA  TRANSLATION 


One  of  the  problems  that  faces  the  designer  of  a  distributed  DBMS 
composed  of  multiple  software  systems  on  a  heterogeneous  network  Is 
data  Incompatibility.  The  problem  of  disparate  Internal  data  represen¬ 
tations  is  complicated  by  different  logical  structures  in  the  data  base 
system.  Since  these  differences  are  a  fact  of  life  in  data  processing, 
a  method  of  data  base  translation  is  necessary  for  the  most  general 
tcase  of  distributed  network.  For  a  network  with  K  different  data  base 
systems,  a  "brute  force"  translation  approach  is  to  construct  a  unique 
translator  for  each  pair  of  data  base  systems.  The  problem  of  translating 
between  two  given  data  base  systems  on  specified  computers  is  a  well 
defined  but  non- trivial  task.  However,  this  approach  would  require 
each  DBMS  node  co  have  2  (K-l)  translators  available  to  map  to  and  from 
every  other  DBMS  node  in  the  network. 

Several  alternatives  to  the  "brute  force"  or  "K  to  K"  data 

translation  approach  have  been  considered.  A  major  effort  in  this 
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area  is  the  University  of  Michigan  Data  Translation  Project. 

Figure  6  illustrates  the  translation  methodology  developed  by  Fry 

and  his  associates  at  Michigan.  All  data  bases  are  described  using 

a  universal  Stored  Data  Definition  Language  (SDDL) .  The  translations 

are  driven  by  tables  produced  by  compilers  for  the  SDDL  and  the 

Translation  Definition  Language  (TDL) .  The  TDL  is  employed  to  express 

the  relationship  between  the  source  and  target  data  bases.  This 

translation  methodology  requires  only  one  translation  program  at  each 

DBMS  node.  However  the  translator  must  be  supplemented  with  an  SDDL 
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Table  and  (K-l)  TDL  Tables.  Birss  and  Fry  discuss  the  feasibility 
of  the  Data  Translation  Project  Methodology  based  upon  their  prototyping 
experiences. 
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Schneider  proposes  a  Data  Spoci f ication  and  Conversion  Language 
(DSCL)  for  data  base  networks.  The  DSCL  is  a  high  level  language  for 
data  translation.  Schneider  suggests  that  one  machine  in  the  distri¬ 
buted  DBMS  serve  as  the  network  translator  as  shown  in  Figure  7.  All 

communication  with  the  translation  machine  uses  DSCL.  The  translation 
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machine  effectively  contains  K  different  translation  programs.  The 
main  advantage  of  this  technique  is  that  the  only  additional  software 
required  on  the  DBMS  nodes  are  utility  routines  to  map  to  and  from 
DSCL. 

12 

The  Datacomputer  which  is  a  back-end  machine  in  the  ARPA  network 

communicates  with  its  host  machines  only  in  Datalanguage .  Each  host 

processor  accessing  information  controlled  by  the  Datacomputer  must 

perform  translations  to  and  trom  the  Datalanguage  format.  The 

Datacomputer  is  a  special  case  of  the  general  data  translation  problem. 

However,  the  fact  that  it  is  a  viable  node  in  the  ARPA  network  illustrates 

the  feasibility  of  that  approach. 

The  approaches  to  data  translation  discussed  thus  far  are  totally 
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automatic.  Su  and  Lam  have  designed  an  interactive  translation 
system  in  which  the  user  participates  in  composing  the  logical  structure 
of  the  target  data  base.  The  system  requires  a  separate  translator 
for  each  distinct  target  machine.  This  approach  provides  the  user  with 
considerable  flexibility  in  restructuring  the  target  data  base. 

However,  it  requires  user  intervention  for  any  intermachine  data  base 
communication.  Therefore,  the  interactive  approach  is  best  suited  for 
infrequent  large  scale  data  transfers.  A  prototype  has  been  con¬ 


structed  by  Su  and  Lam. 
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Distributed  data  base  systems  with  information  and  control  spread 
over  a  number  of  computers  pose  many  interesting  security  problems. 
Since  many  data  bases  contain  classified  or  private  information, 
thought  must  be  given  to  the  security  of  data  before  determining 
if  any  benefits  are  to  be  gained  by  multi-computer  access.  The 
principal  security  question  with  regard  to  distributed  data  bases  is 
whether  a  distributed  DBMS  is  inherently  more  or  less  secure  than  a 
single  machine  system.  An  analysis  of  distributed  data  base  systems 
indicates  that  they  provide  both  advantages  and  drawbacks  with  respect 
to  security. 

The  security  benefits  arise  in  configurations  which  contain 
dedicated  back-end  machines.  Lownethal^  indicates  that  a  computer 
solely  dedicated  to  the  processing  of  data  base  operations  is  able  to 
screen  every  data  base  request  from  the  host  machines.  One  important 
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security  aspect  of  the  dedicated  back-end  machine  is  that  no  appli¬ 
cation  programs  execute  on  it.  This  eliminates  the  threat  of  a 
malevolent  program  monitoring  data  base  activity. 

The  most  outstanding  security  liability  of  a  distributed  data 
base  system  is  the  use  of  public  communication  lines  in  geographically 
dispersed  networks.  An  intruder  using  current  technology  can  easily 
■unitor  the  transmissions  between  remote  installations.  A  designer  of 
a  secure  geographically  distributed  DBMS  must  rely  upon  encryption 
techniques  to  preserve  security.  ^ 

STATE  OF  THE  INDUSTRY 

Distributed  data  bases  are  on  the  verge  of  becoming  a  commercial 
reality.  Back-end  data  base  systems  are  presently  in  the  development 
stage  at  the  installations  of  several  hardware  and  software  vendors. 

It  is  difficult  to  report  industrial  progress  without  either  pro¬ 
viding  free  publicity  or  revealing  proprietary  information.  However, 
it  is  Important  that  the  significance  of  distributed  data  base  systems 
be  emphasized  by  describing  the  work  underway  in  the  commercial  sector. 
Therefore,  this  section  contains  brief  descriptions  of  projects 
currently  underway  at  several  vendors'  installations  with  no  specific 
references  to  either  the  vendor  or  its  product  line. 

The  back-end  machine  has  been  a  focal  point  of  the  industry's 
thrust  into  distributed  systems.  Virtually  all  vendors  have  followed 
Canaday's^  initial  prototype  by  using  a  minicomputer  as  the  back-end 
machine.  One  software  vendor  is  presently  developing  a  back-end  DBMS 
targeted  for  a  specific  large  mainframe  host  and  minicomputer  back-end 
Their  DBMS  software  presently  operates  on  both  machines  in  a  stand 
•lone  mannar.  The  major  developmental  effort  on  the  project  is  the 
implementation  of  a  communication  system  between  the  two  machines. 
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When  complete,  this  particular  back-cnd  DBMS  will  be  suitable  for 
use  with  both  locally  and  remotely  connected  machines. 

Another  software  vendor  is  taking  the  data  base  machine  approach 
to  distributed  data  base  systems.  This  company  is  developing  a  back¬ 
end  DBMS  version  of  their  software  on  a  minicomputer  and  a  generalized 
communication  system  written  in  an  easily  portable  systems  implementation 
language.  They  intend  to  market  the  minicomputer  as  a  "blackbox" 

4 

which  can  be  attached  by  means  of  their  communications  software  to 
several  large  mainframe  computers.. 

Hardware  vendors  are  also  actively  pursuing  the  idea  of  dis¬ 
tributed  data  bases.  One  manufacturer  is  studying  the  feasibility 
of  constructing  a  back- end  machine  from  a  cluster  of  microcomputers  in 
•  manner  similar  to  Madnick's  INFOPLEX.13  In  the  system  being  studied, 
a  data  base  request  would  be  decomposed  into  primitives  and  processed 
in  parallel  by  the  microcomputers.  This  approach  is  very  well  suited 
for  a  relational  DBMS. 

Another  hardware  manufacturer  is  investigating  the  concept  of  a 
micr ©programmable  back-end  machine.  This  back-end  processor  would 
be  a  minicomputer  that  could  be  connected  via  a  high  speed  interface 
to  the  vendor's  large  mainframe  CPU's.  Through  the  use  of  special 
purpose  data  base  instructions  in  firmware,  a  high  performance 
back-end  machine  can  be  developed.  The  vendor  is  also  considering 
configurations  of  multiple  back-end  processors. 

From  this  small  sample  of  the  activity  in  the  data  base  Industry, 

It  can  be  seen  that  distributed  data  base  systems,  in  particular 
back-end  machines,  will  soon  appear  in  the  marketplace.  This  is  a 
result  of  an  intersection  of  advances  in  hardware  and  software 
technology  with  a  need  for  greater  access  to  and  sharing  of  information. 
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CONCLUSION’ 

Distributed  data  base  management  systems  are  the  focus  of  a  large 
amount  of  research  and  development  activity  in  both  the  academic  and 
industrial  environments .  Distributed  data  base  systems  provide  a 
means  of  extending  the  capacities  of  computing  systems  to  allow  a 
wide  range  of  information  to  be  accessed  by  many  people.  As  is 
typical  of  the  current  computer  environment,  the  hardware  technology 
of  distributed  data  base  has  advanced  further  than  the  software. 

Many  obstacles  remain  before  a  truly  general  distributed  DBMS 
will  appear.  However,  back-end  data  base  systems,  data  base  machines, 
and  distributed  information  systems  are  available  now  (or  will  be 
available  within  the  next  year).  In  the  next  several  years,  advances 
in  distributed  data  base  management  systems  should  be  among  the  most 
significant  in  the  computer  field. 
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